April 12, 2018

Reproducibility: who cares?

Science retracts gay marriage paper without agreement of lead author LaCour

  • In May 2015 Science retracted a study of how canvassers can sway people's opinions about gay marriage published just 5 months earlier.

  • Science Editor-in-Chief Marcia McNutt: Original survey data not made available for independent reproduction of results. + Survey incentives misrepresented. + Sponsorship statement false.

  • Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.

  • Methods we'll discuss today can't prevent this, but they can make it easier to discover issues.

Source: http://news.sciencemag.org/policy/2015/05/science-retracts-gay-marriage-paper-without-lead-author-s-consent

Bad spreadsheet merge kills depression paper, quick fix resurrects it

Divorce study felled by a coding error gets a second chance

Divorce study retraction: Editor's note

  • "The research environment is fast-paced given the ethos to “publish or perish"."

  • "[…] research is becoming increasingly complex, with greater calls for transdisciplinary collaborations, “big data,” and more sophisticated research questions and methods […] data sets often have multiple files that require merging, change the wording of questions over time, provide incomplete codebooks, and have unclear and sometimes duplicative variables."

  • "Given these issues, I would not be surprised if coding errors were fairly common […]"



Source: http://retractionwatch.com/2015/09/10/divorce-study-felled-by-a-coding-error-gets-a-second-chance/#more-32151

Reproducibility: why should you care?

Think back to every time…

  • The results in Table 1 don't seem to correspond to those in Figure 2.
  • In what order do I run these scripts?
  • Where did we get this data file?
  • Why did I omit those samples?
  • How did I make that figure?
  • "Your script is now giving an error."
  • "The attached is similar to the code we used."



Source: Karl Broman

No collaborators?





Your closest collaborator is you six months ago,
but you don’t reply to emails.

- Mark Holder




Reproducibility: how?

Reproducibility checklist

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)
  • Can the code be used for other data?
  • Can you extend the code to do other things?

Ambitious goal + many other concerns

We need an environment where

  • data, analysis, and results are tightly connected, or better yet, inseparable

  • reproducibility is built in
    • the original data remains untouched
    • all data manipulations and analyses are inherently documented
  • documentation is human readable and syntax is minimal

Toolkit

Outline

  1. Scriptability \(\rightarrow\) R

  2. Literate programming \(\rightarrow\) R Markdown

  3. Version control \(\rightarrow\) Git / GitHub

1. Scriptability

Point-and-click vs. scripting

  • Learning curve: Point-and-click software (supposedly) have shallower learning curves than scripting languages

  • Documentation: At a minimum, your code documents your analysis
    • And you can do better with comments and README files
  • Automation: Need to rerun your analysis with new/updated data? Just change the input file.

  • Collaboration: Sharing your analysis is as easy as sharing your scripts

Why R?

  • Programming language for data analysis
  • Free!
  • Open source
  • Widely used and supported across all disciplines
  • Can be used on Windows, Mac OS X, or Linux
  • Thousands of statistical data analysis packages

RSplashScreen

Why not language X?

  • There are a number of other great programming tools out there that can also be used to improve the reproducibility of your analysis

  • The key is to use some type of language that will allow you to automate and document your analysis

  • Once you master one language you'll probably find it easier to learn another

Once in R

You could just type into the command prompt, but that doesn't help much with

  • documentation

or

  • automation

RSplash

2. Literate programming

Donald Knuth "Literate Programming (1983)"

"Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do."

"The practitioner of literate programming […] strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other."

  • These ideas have been around for years!
  • and tools for putting them to practice have also been around
  • but they have never been as accessible as the current tools

A better solution than just R

With RStudio you can combine your programming and your documentation

  • Gives you a single environment to combine your documentation and your analysis
  • Runs on top of R

RSplashScreen

What is Markdown?

  • Markdown is a lightweight markup language for creating HTML (or XHTML) documents.

  • Markup languages are designed to produce documents from human readable text (and annotations).

  • Some of you may be familiar with LaTeX. This is another (less human friendly) markup language for creating pdf documents.

  • Why I love Markdown:
    • Simple syntax means easy to learn and use.
    • Focus on content, rather than coding and debugging errors.
    • Allows for easy web authoring.
    • Once you have the basics down, you can get fancy and add HTML, JavaScript, and CSS.

Sample Markdown document

markdown

What is R Markdown?

Well, it's R + Markdown:

  • Ease of Markdown syntax

  • Rendering of R code to produce output and plots

  • Ability to include LaTeX: \(\hat{y} = \beta_0 + \beta_1 \times x\)

Sample R Markdown document

rmarkdown

Another R Markdown document





This presentation!





Example: Big Five Personality Test

The Big Five personality traits is a theory of five broad dimensions used by some psychologists to describe the human personality and psyche: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism.



Load data with an R chunk:

big5 <- read_delim("raw-data/big5.txt", delim = "\t")
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   country = col_character()
## )
## See spec(...) for full column specifications.



Sources: Wikipedia and http://personality-testing.info/_rawdata/.

Under the hood

rchunk

View data

big5
## # A tibble: 19,719 x 57
##     race   age engnat gender  hand source country    E1    E2    E3    E4
##    <int> <int>  <int>  <int> <int>  <int> <chr>   <int> <int> <int> <int>
##  1     3    53      1      1     1      1 US          4     2     5     2
##  2    13    46      1      2     1      1 US          2     2     3     3
##  3     1    14      2      2     1      1 PK          5     1     1     4
##  4     3    19      2      2     1      1 RO          2     5     2     4
##  5    11    25      2      2     1      2 US          3     1     3     3
##  6    13    31      1      2     1      2 US          1     5     2     4
##  7     5    20      1      2     1      5 US          5     1     5     1
##  8     4    23      2      1     1      2 IN          4     3     5     3
##  9     5    39      1      2     3      4 US          3     1     5     1
## 10     3    18      1      2     1      5 US          1     4     2     5
## # ... with 19,709 more rows, and 46 more variables: E5 <int>, E6 <int>,
## #   E7 <int>, E8 <int>, E9 <int>, E10 <int>, N1 <int>, N2 <int>, N3 <int>,
## #   N4 <int>, N5 <int>, N6 <int>, N7 <int>, N8 <int>, N9 <int>, N10 <int>,
## #   A1 <int>, A2 <int>, A3 <int>, A4 <int>, A5 <int>, A6 <int>, A7 <int>,
## #   A8 <int>, A9 <int>, A10 <int>, C1 <int>, C2 <int>, C3 <int>, C4 <int>,
## #   C5 <int>, C6 <int>, C7 <int>, C8 <int>, C9 <int>, C10 <int>, O1 <int>,
## #   O2 <int>, O3 <int>, O4 <int>, O5 <int>, O6 <int>, O7 <int>, O8 <int>,
## #   O9 <int>, O10 <int>

Clean data

You can include script files in your R Markdown document

source("code/01-data-cleanup.R")

script

View distribution of age

ggplot(big5, aes(x = age)) +
  geom_histogram()

summary(big5$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   18.00   22.00   26.26   31.00   99.00

Regress extraversion vs. neuroticism and gender

Extraversion: Seeking fulfillment from sources outside the self or in community. High scorers are social, low scorers prefer to work alone. Neuroticism: Being emotional.

m_ext_age <- lm(extraversion ~ neuroticism * gender, data = big5)
tidy(m_ext_age)
##                      term     estimate   std.error   statistic
## 1             (Intercept) 15.202757998 0.190240289 79.91345094
## 2             neuroticism  0.297345720 0.009614958 30.92532635
## 3              genderMale -1.893016800 0.327307851 -5.78359728
## 4             genderOther -5.721793849 2.177580307 -2.62759258
## 5  neuroticism:genderMale  0.001576252 0.015225987  0.10352377
## 6 neuroticism:genderOther -0.008332371 0.125204817 -0.06654992
##         p.value
## 1  0.000000e+00
## 2 4.504207e-205
## 3  7.423191e-09
## 4  8.605839e-03
## 5  9.175483e-01
## 6  9.469407e-01

Plot extraversion vs. age and gender

ggplot(data = big5, aes(x = neuroticism, y = extraversion, color = gender)) +
  geom_point(alpha = 0.5) +
  geom_jitter() +
  geom_smooth(method = "lm")

Suppose you want only teens

big5_teen <- filter(big5, age <= 19)
m_ext_age_teen <- lm(extraversion ~ age * gender, data = big5_teen)
tidy(m_ext_age_teen)
##              term   estimate   std.error  statistic      p.value
## 1     (Intercept) 14.1253643  1.43787893  9.8237508 1.261620e-22
## 2             age  0.3009079  0.08501684  3.5393913 4.037551e-04
## 3      genderMale  6.7870228  2.47559320  2.7415743 6.130681e-03
## 4     genderOther  6.6600580 11.01227787  0.6047848 5.453424e-01
## 5  age:genderMale -0.4206645  0.14590370 -2.8831655 3.949463e-03
## 6 age:genderOther -0.7617389  0.66364213 -1.1478158 2.510854e-01

Plot for only teens

ggplot(data = big5_teen, aes(x = neuroticism, y = extraversion, color = gender)) +
  geom_point(alpha = 0.5) +
  geom_jitter() +
  geom_smooth(method = "lm")

3. Version control

What is version control?

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.

Bad

Good

    2013-10-14_manuscriptFish.doc
    2013-10-30_manuscriptFish.doc
    2013-11-05_manusctiptFish_intitialRyanEdits.doc
    2013-11-10_manuscriptFish.doc
    2013-11-11_manuscriptFish.doc
    2013-11-15_manuscriptFish.doc
    2013-11-30_manuscriptFish.doc
    2013-12-01_manuscriptFish.doc
    2013-12-02_manuscriptFish_PNASsubmitted.doc
    2014-01-03_manuscriptFish_PLOSsubmitted.doc
    2014-02-15_manuscriptFish_PLOSrevision.doc
    2014-03-14_manuscriptFish_PLOSpublished.doc

Better - Saving everything together at once

Everytime you make a save, you zip the entire directory that your project files are in and save it with a date.

Best - Version Control

How does a version control system work?

  • Start with a base version of the document, save just the changes you made at each step of the way.

  • Think of it as a tape: if you rewind the tape and start at the base document, then you can play back each change and end up with your latest version.

  • "Playing back" different sets of changes onto the base document and getting different versions of the document.

Source: Software Carpentry.

Git/GitHub

  • Easy to set up
  • Integrated with RStudio
  • GitHub's strong community: your colleagues are probably already there
  • Provides tools to help enhance collaboration
  • A common location to share your work

Commits

Diff

Parting remarks

Two-pronged approach

Everyone struggles with reproducibility and it is a hindrance to moving science forward.

#1 Adopt a reproducible research workflow



#2 Train new researchers who don’t have any other workflow

two prongs

Resources